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ABSTRACT 


The HMLP is an ANN similar to the MLP, but with extra weighted 
connections that connect the input nodes directly to the output nodes. The 
architecture of the HMLP neural network for implementation on FPGA is 
proposed. The HMLP architecture is designed to be concurrent to 
demonstrate the parallel nature of the HMLP where each hidden or output 
node within the same hidden or output layer of the HMLP can calculate its 
output independently. The HMLP architecture is designed to be modular as 
well, such that if modification to a module is necessary, only the specific 
module need to be modified and all other modules can be retained. This 
modularity will be especially helpful when different activation function is to 
be swapped in to replace current activation function. All calculations in the 
HMLP are performed in floating-point arithmetic. The HMLP architecture is 
compiled, simulated and finally implemented on the Cyclone V FPGA of 


VHDL DE1-SoC board. The simulation outcome and FPGA outputs showed that the 
developed HMLP architecture is able to calculate correct output values for all 
test datasets. 
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1, INTRODUCTION 

Various types of Artificial Neural Networks (ANN) had been applied for many kinds of 
applications, of which the Multilayered Perceptron (MLP) neural network is very popular. The MLP is made 
up of multiple layers of simple processing elements called nodes. The layers of MLP are one input layer, 
one output layer, and one or more hidden layers. The nodes and layers of MLP are arranged in a feedforward 
arrangement [1]. The MLP has a fully connected structure where each node is connected to every other nodes 
of next layer via weighted connections. The MLP’s hidden layers perform a non-linear mapping of the input 
layer to the output layer [2]. 

Hybrid Multilayered Perceptron (HMLP) is a type of ANN that is based on the popular MLP. 
The HMLP enhances the existing MLP’s ability by having the input layer directly connects to the output 
layer through some weighted connections. The additional weighted connections of HMLP effectively form a 
linear model in parallel to the non-linear MLP [3]. The HMLP had been applied for different applications, 
such as heart anomaly detection using electrocardiogram (ECG) data [4], classification of acute leukemia 
disease [5] and forecasting of car speed [6]. The HMLP had also been implemented on hardware, the most 
recent hardware implementation of HMLP was on a Rabbit Core Module RCM4100 microcontroller [7]. 

Field Programmable Gate Array (FPGA) is a reconfigurable digital logic device that can be used for 
implementation of various digital systems. The concurrent nature of FPGA provides engineers a suitable 
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platform to design and implement any concurrent digital system such as the MLP. A Xilinx Zynq FPGA-SoC 
device was used to implement a MLP for gas classification application in a Wireless Gas Sensor Network 
system [1] and for arrhythmia detection from electrocardiogram (ECG) signals [2]. The DE2-70 FPGA 
board, with an Altera Cyclone IJ FPGA device on board, was the platform of choice to compare the 
performance of an MLP implemented on hardware-based FPGA and a same MLP implemented on NiosII 
softcore processor on FPGA. The comparison result indicated that MLP implemented directly on FPGA 
hardware executes significantly faster than MLP which implemented on NiosII but at the expense of greater 
resource utilisation [8]. 

Based on existing works on implementation of MLP on FPGA, the prospect of using the FPGA as a 
platform to implement the HMLP seems very promising. From previous literature, no existing work on 
implementing HMLP directly on FPGA was reported [9]. This paper is a continuation of works reported in 
[9] and proposes an FPGA-based architecture for implementing the HMLP on an FPGA. 

The Introduction should provide a clear background, a clear statement of the problem, the relevant 
literature on the subject, the proposed approach or solution, and the new value of research which it is 
innovation. It should be understandable to colleagues from a broad range of scientific disciplines. 
Organization and citation of the bibliography are made in Vancouver style in sign [1], [2] and so on. 
The terms in foreign languages are written italic (italic). The text should be divided into sections, each with a 
separate heading and numbered consecutively. The section/subsection headings should be typed on a separate 
line, e.g., 1. Introduction [3]. Authors are suggested to present their articles in the section structure: 
Introduction - the comprehensive theoretical basis and/or the Proposed Method/Algorithm - Research Method 
- Results and Discussion — Conclusion. 

Literature review that has been done author used in the chapter "Introduction" to explain the 
difference of the manuscript with other papers, that it is innovative, it are used in the chapter "Research 
Method" to describe the step of research and used in the chapter "Results and Discussion" to support the 
analysis of the results [2]. If the manuscript was written really have high originality, which proposed a new 
method or algorithm, the additional chapter after the "Introduction" chapter and before the "Research 
Method" chapter can be added to explain briefly the theory and/or the proposed method/algorithm [4]. 


2. HYBRID MULTILAYERED PERCEPTRON NEURAL NETWORK 

Figure | illustrates the structure of the HMLP [3]. The HMLP consist of 3 layers, one input layer, 
one hidden layer and one output layer. Being based on the MLP, the structure of HMLP greatly resembles 
MLP’s structure with addition of several weighted connection from the input layer directly to the output 
layer. HMLP’s additional weighted links is shown as dashed lines in Figure 1. 


— —r Additonal connection 





Figure 1. Structure of HMLP compared to MLP structure [3] 


Based on the structure of HMLP, the output equation of a HMLP’s k-th output neuron, yk, with | 
hidden layer is given as (1). 


,.= hes we (d wiv; + bi) + aur wiv? + bz ;forl<k<n, (1) 
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where n;, NM, and n, are the number of input nodes, hidden nodes and output nodes of the HMLP, respectively; 
W ij W ik and w';, are the weights from input layer to hidden layer, the weights from hidden layer to output 
layer, and the weights of additional connection from input layer to output layer, respectively; b’ j and b’,, are 
bias input to the j-th hidden node and the k-th output node, respectively; and v’ is the input to the i-th input 
nodes of the HMLP. F(-) is the activation function at the hidden nodes. In this paper, the activation function 
is selected as a sigmoidal function given as (2). 


1 


Fa. (2) 

The weights W ij W ky ee and biases bi, and b*, are unknown, and should be selected to minimise 
the prediction error. Similar to the MLP, HMLP’s topology means that each node of the HMLP processes its 
respective output without interacting with other nodes of the same layer. This means that, in order to harness 
the inherent parallelism offered by the HMLP or other neural networks with MLP-like structure, 
implementing the HMLP on a concurrent system such as the FPGA will be more favourable than running the 
HMLP on sequential system such as a PC, a general purpose microprocessor or a microcontroller. 


3. DESIGN OF FPGA-BASED ARCHITECTURE OF HMLP 

The architecture of the proposed HMLP for FPGA is planned using top-down approach and is 
heavily based on HMLP’s structure in Figure 1 and HMLP’s output (1). The development of the architecture 
for the HMLP is done via bottom-up approach, whereby the lowest-level modules are described first. 
Next higher-level modules are later described level-by-level, until finally the top-level module which 
encompassed all submodules that make up the HMLP is described. 

The HMLP is designed to be modular such that if modification to a module (such as activation 
function module) is required, the modification can be performed on the target module while retaining all 
other modules of the whole structure. The HMLP architecture is designed with concurrency in mind to 
observe the inherent parallelism of a HMLP. The architecture is designed such that the hidden nodes will 
execute multiply-add operation in parallel with other hidden nodes. While hidden nodes are processing their 
outputs, the output nodes will perform multiply-add operation of the weighted input layer to output layer 
connection at the same time, as shown in Figure 2. This concurrent architecture reduces the overall 
processing time, but at the expense of increased logic element utilization. 


Sum Thaat_mecade mel _aecede 
= pea ior eed tim pee 





Tx 
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Figure 2. Timeline of hmlp_fpga execution showing the start signal, hid_node done signal, out_node done 
signal and relative processing time of hid_node and out_node 


3.1. Modules and Sub-Modules 

Figure 3 shows the modules within the top-level module hmlp_fpga that make up the HMLP 
architecture for FPGA implementation and all major internal connections. The modules that make up 
hid_node and out_node are shown in Figure 4. 

Based on HMLP’s structure, the hidden and output nodes, represented by the modules hid_ node and 
out_node respectively, execute most arithmetic operations of the HMLP. Both hid_node and out_node 
modules comprise of a multiply-add module and activation function module. 

All nodes of a given layer are connected to adjacent layers via weighted connections. 
These connections are described in higher level modules, which mainly operate as to route the input signals, 
weights, biases and other intermediate signals to the correct hidden or output node. 
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Figure 3. Modules within top level module hmlp_fpga 


Figure 4. Lower-level modules within hid_node and out_node 


Low-level modules hid_madd and out_madd are described to perform multiplication of the input 
signals to its respective weights and then to sum the input-weight products at the HMLP’s hidden and output 
nodes, respectively. These modules will multiply its input-weight pair one-by-one and accumulate all product 
of the multiplication. Module out_madd perform almost similar operation to hid_madd but with the extra 
processing of weighted connection from input layer to output layer of the HMLP. Floating-point 
multiplication IP and addition IP are used to implement the multiplication and summation operations. A state 
machine controls the overall operation of the hid_madd and out_madd modules. The accumulated value is 
then passed to next module that applies the activation function to the accumulated value. 

The summed values from hid_madd and out_madd are passed to next modules, afunc_sigmoid and 
afunc_linear. Both are the modules describing the operation of the sigmoidal activation function at the hidden 
nodes and the linear activation function at the output nodes respectively. These activation function modules 
are described as separate module from the multiply-add modules because if different activation function is 
required, the new activation function can be swapped in while all other modules can be retained. 
In afunc_sigmoid module, the sigmoidal activation function is implemented using floating-point IPs for 
exponent, addition and inversion operations with a state machine to control the flow of data through the 
afunc_sigmoid module and generate necessary control signals. In afunc_linear module, no arithmetic 
operation is performed, only a state machine is described to control the flow of data through the afunc_linear 
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module and generate necessary control signals. The activation function’s output from afunc_sigmoid or 
afunc_linear module will be the output of individual hid_node or out_node module. 

Next higher-level module described the HMLP’s hidden node and output node. Module hid node 
encapsulates hid_madd and afunc_sigmoid modules that perform the operation of the hidden nodes. 
Likewise, out_node module encapsulates out_madd and afunc_linear that performs the operation of output 
nodes of the HMLP. 

The hid_layer and out_layer modules repetitively generate the hid_node module and out_node 
module that formed the HMLP’s hidden layer and output layer, respectively. Top level module hmlp_ fpga 
connects all modules, input signals, output signals and intermediate signals to implement the HMLP structure 
as in Figure 1 on FPGA. 


3.2. Floating Point Number Representation 

Real numbers representation is crucial and on an FPGA, all real numbers need to be described in 
binary. In this architecture, floating point number representation is utilised to reduce development time 
instead of describing a unique number representation. Using standardised number representation, such as the 
TEEE754 standard [10] that define the bit ordering of 32-bit single precision floating-point numbers, 
allow for better integration with other devices, avoid potential confusion when communicating with different 
systems, and makes custom designs more adaptable for future developments. 

Arithmetic operations on the floating-point numbers is done through the use of Floating-Point 
Arithmetic IPs from Altera (now Intel FPGA) that meets most of the IEEE754 standard [11]. These floating- 
point arithmetic IPs are used in the hid_madd, out_madd and afunc_sigmoid modules to perform operations 
of multiplication, addition, exponent, and inversion. 


4. SIMULATION 

The architecture of the HMLP is described in VHDL and compiled on Altera Quartus II version 
15.0 Web Edition. The architecture is then simulated using ModelSim 10.3d software. A VHDL testbench 
tb hmlp fpga is described to provide the necessary stimuli to the HMLP’s top level module hmlp_fpga. 
For the simulation, inputs to hmlp_fpga are set by tb_hmlp_fpga and the outputs of hmlp_fpga are observed 
from the simulation waveform. The inputs and weights data for hmlp fpga’s simulation is generated using 
MATLAB. Figure 5 shows the ModelSim simulation output of hmlp_fpga for n; = 3, n, = 3, and n, = 2. 
The start signal, hid_layer done and final done signal is pointed out by its respective cursor. 





Figure 5. Simulation waveform of the designed HMLP architecture 


The simulation is repeated three times with different n;, n, and n, combinations. For each n;, n, and 
n, combination, five different input datasets are randomly generated using MATLAB and coded into the 
simulation testbench tb_hmlp_fpga. Simulation output is compared to the result computed by MATLAB. 
The simulation shows that hmlp_fpga produced similar outputs to MATLAB results. Some deviation 
between simulation output and MATLAB result do exist, but the difference is very small and can be 
considered to be negligible. 
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5. IMPLEMENTATION ON FPGA 

After the simulation stage, the designed HMLP architecture is then compiled for the target FPGA 
development board. All compilation, netlist generation, place and route, and generation of configuration 
bitstream are done using Quartus IH. For the purpose of testing the HMLP architecture on the FPGA, 
a synthesisable testbench is described to assign the inputs and weights data to the HMLP and display the 
output data in human readable form. 


5.1. FPGA Device/Development Doard 

The target FPGA development board for testing the HMLP architecture is the DE1-SoC from 
Terasic. The DEI-SoC has an Altera S;CSEMASF31C6 FPGA chip as the main FPGA device and other 
necessary circuitries for the Cyclone V FPGA-SoC to function. DE1-SoC is also equipped with various 
input/output components and devices for users to explore the FPGA-SoC [12]. 

The on-board slide switches, push-button switches, LEDs, 7-segment displays and clock generator 
are used to test the designed HMLP architecture. The synthesisable testbench takes user inputs from the 
switches and pushbuttons, assigns the HMLP’s test inputs and weights to the hmlp fpga module, and lastly 
displays the HMLP outputs from hmlp_fpga module accordingly on the 7-segment displays and LEDs. 


5.2. Compilation Report 

A full compilation report of the hmlp_fpga module and the synthesisable testbench for the Cyclone 
V FPGA 1s shown by Altera Quartus II after successful compilation. Useful details such as the target FPGA’s 
resource utilisation can be extracted from the compilation report. Overall, resource utilisations of the 
designed HMLP architecture including the synthesisable testbench are shown in Table 1. 


Table 1. CycloneV SE Resource Utilisation taken from Quartus If Compilation Report 


HMLP ALM Total Total block Total DSP HMLP Compute time 

structure register memory bits block 
nj = 3 
Ny, = 3 Ei akirby 8576 eT Oe 80 53/87 (61%) 115 clock cycles (2.3 us on 50 MHz clock) 
es (17%) (0%) 
nj = 8 
oe. Se a 8683 Seco ee oaey 53/87 (61%) 150 clock cycles (3.0 us on 50 MHz clock) 
n= (17%) (0%) 
nj = 3 
pea | 11053 7814069280" 9 787 (80%). 157 Glock cyelés (3:14 jis 6n 50 MHz clock) 
a (22%) (0%) 
nj = 8 
ae. oe 8585 papa eoee0 53/87 (61%) 150 clock cycles (3.0 us on 50 MHz clock) 
= (17%) (0%) 


For a HMLP with greater n;, n, or n,, it’s FPGA resource utilisation will be greater. Greater n; will 
add to compute time but has minimal impact on resource utilisation; increasing n, will add compute time and 
uses more resource; whereas increasing n, has no impact on compute time but increases resource usage. 

The ALM stand for Adaptive Logic Module in Altera FPGAs [13]. In this design, most of the FPGA 
resources is used up by the Altera’s Floating-Point IPs, especially the IPs for exponent and inversion 
operation to compute the afunc_sigmoid sigmoid activation function module at the hidden nodes. This is 
proven by the significant increment of resource utilisation when ny, is increased from 3 to 4 as in Table 1. 


5.2. FPGA Implementation Outcome 
From the simulation and FPGA implementation outcomes, the time needed for the HMLP 
architecture to compute its output value, in term of number of clock cycles, 1s determined by (3). 


hmlp_fpga output latency = 4+ 69 + 7n; + 7n;, (4) 


Based on (3), the afunc_linear module takes 4 clock cycles to produce stable result at its output port. 
69 clock cycles are the duration needed by afunc_sigmoid module to finish its operation and produce stable 
result on its output port. 7n; and 7n, are the output latency of the hid_madd and out_madd modules 
respectively. The output latency of these multiply-add modules varies depending on n; and n;, because 
multiply-add is a sequential operation, thus increasing n; or n, will add to the overall output latency of the 
hmlp_fpga module. 
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For a HMLP with n; = 3, n, = 3, n, = 2, and system clock of 50 MHz, the time needed for the HMLP compute 
its output is 2.3 us (115 clock cycles). 

The concurrent structure of the HMLP allow for all nodes to process its output concurrently, 
thus reducing the overall processing time. The processing time for multiply-accumulate operation performed 
by the hid_madd and out_madd modules is greatly affected by the number of input to the respective module. 
This is because multiply-add operation is a sequential operation. More inputs hid_madd and out_madd have, 
its processing time will be greater. To reduce the processing time of hid_madd and out_madd modules, 
the multiply operation and add operation is pipelined, such that these modules can calculate the product of 
next input-weight pair while adding and accumulating previous multiplication product as shown in Figure 2. 

As in the simulation, the compilation and FPGA implementation of the designed HMLP architecture 
is repeated three times with different n;, n, and n, combinations. For each n;, n, and n, combination, 
five different input datasets are used. The datasets used for FPGA implementation are the same dataset used 
for simulation. The outcome of the designed HMLP architecture on FPGA is compared with the simulation 
result. The output value from the FPGA implementation is found to be the same as its simulation result. 


6. CONCLUSION 

This paper had described the development of architecture of HMLP neural network for 
implementation on Cyclone V FPGA. The HMLP architecture is designed to be concurrent and modular. 
The proposed architecture is compiled using Altera Quartus II and simulated on ModelSim. The simulation 
reveals that the proposed HMLP architecture is able to produce desired result. Next, HMLP is implemented 
on the Cyclone V FPGA device on board DE1-SoC development and education kit. The FPGA produced 
same outputs as compared to the simulation and MATLAB. 
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